Information Retrieval Using Statistical Classification
نویسندگان
چکیده
In the classical information retrieval IR problem the system must nd all docu ments in a collection that are related to a topic de ned by a user s query A common approach to the IR problem is to represent documents and the query as vectors of term frequencies and rank the documents in the collection according to their inner product similarity with respect to the query When a sample of evaluated docu ments is available in addition to the query often called routing the problem can be attacked using techniques based on statistical classi cation In order for statistical classi cation to be a feasible approach the system must produce a relatively small set of high quality feature variables It turns out that individual words due to their quantity and ambiguity are not optimal features Previous work has focused on a technique known as Latent Semantic Indexing LSI which applies the singular value decomposition to a term document matrix and represents terms and documents by linear combinations of orthogonal indexing variables The research presented in this thesis accomplishes the following goals It provides a thorough discussion of evaluation in information retrieval experiments It intro duces the concept of a local LSI decomposition LSI is used separately on a set of documents in the local region surrounding each query creating query speci c feature variables and making the LSI technique feasible for very large document collections It applies the classi cation technique known as Discriminant Analysis to the routing problem and presents experimental results on two text collections It demonstrates that using a local LSI decomposition improves retrieval performance and represents documents using a relatively small number of feature variables It nds that Dis criminant Analysis sometimes leads to additional performance gains but that more research is needed to determine the optimal size and shape of the local region
منابع مشابه
QEA: A New Systematic and Comprehensive Classification of Query Expansion Approaches
A major problem in information retrieval is the difficulty to define the information needs of user and on the other hand, when user offers your query there is a vast amount of information to retrieval. Different methods , therefore, have been suggested for query expansion which concerned with reconfiguring of query by increasing efficiency and improving the criterion accuracy in the information...
متن کاملUsing Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کاملUsing Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کاملThe socio - cognitive theory in information retrieval (IR)
Abstract Background and Aim: The socio-cognitive theory introduced in information science by Horland and Alberchtsen. The socio-cognitive view turns the traditional cognitive program upside down. The socio-cognitive theory emphasizes on different cultural and social structures of users. Hence, the aim of the article is to explain the role of socio - cognitive theory in information retrieval (I...
متن کاملText-Based Information Retrieval Using Exponentiated Gradient Descent
The following investigates the use of single-neuron learning algorithms to improve the performance of text-retrieval systems that accept natural-language queries. A retrieval process is explained that transforms the natural-language query into the query syntax of a real retrieval system: the initial query is expanded using statistical and learning techniques and is then used for document rankin...
متن کاملA statistical approach to crosslingual natural language tasks
The existence of huge volumes of documents written in multiple languages in Internet lead to investigate novel approaches to deal with information of this kind. We propose to use a statistical approach in order to tackle the problem of dealing with crosslingual natural language tasks. In particular, we apply the IBM alignment model 1 with the aim of obtaining a statistical bilingual dictionary ...
متن کامل